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Abstract. Keywords in scientific articles have found their significance in information filtering and classifi- 
cation. In this article, we empirically investigated statistical characteristics and evolutionary properties of 
keywords in a very famous journal, namely Proceedings of the National Academy of Science of the United 
States of America (PNAS), including frequency distribution, temporal scaling behavior, and decay factor. 
The empirical results indicate that the keyword frequency in PNAS approximately follows a Zipf 's law with 
exponent 0.86. In addition, there is a power-low correlation between the cumulative number of distinct 
keywords and the cumulative number of keyword occurrences. Extensive empirical analysis on some other 
journals' data is also presented, with decaying trends of most popular keywords being monitored. Inter- 
estingly, top journals from various subjects share very similar decaying tendency, while the journals of low 
impact factors exhibit completely different behavior. Those empirical characters may shed some light on 
the in-depth understanding of semantic evolutionary behaviors. In addition, the analysis of keyword-based 
system is helpful for the design of corresponding recommender systems. 

PACS. 89.75.-k Complex systems - 05.65.-fb Self-organized systems 

1 introduction sure [T]. As pointed out by Graemes [2], semantics does 

not aim at making description of every word in the natu- 

The study on semantics has a long history from its birth ,i uij-uri- iUfj j-i-cj 

to .7 j-ai language, but establisnmg the fundamental of descrip- 

by Breal in 1893. It has been acquainted as a branch of,- jj 

^ tive meta-language, according to wnicn we can record and 

glossology. The modern semantic theory begins with the -r ,i j r ^ ^ i • i- 

° JO unity the procedure of content description. 

book, Course in General Linguistics, authorized by Saus- 
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The traditional semasiology analyzes the evolutionary 
properties of the acceptation mainly from the historical 
viewpoint, whereas the modern theory extends the hori- 
zon to the selection of new expressions, the existing and 
vanishing of phrases, systematicness of acceptation and 
the meaning of sentences. Recently, as a new interdisci- 
plinary issue, semiotic dynamics has attracted more and 
more attention from different scientific communities. Com- 
pared with the traditional glossology and semasiology, the 
semiotic dynamics treats word and morpheme as the basic 
unit of content, and focuses on the understanding of how 
our communication pattern affects the human semantic 
system, as well as the underlying mechanism of evolution, 
emergence, self-organization and self-adaptation of the se- 
mantic system [3J|31[S]. Therefore, semiotic dynamics not 
only extends the research scope of traditional semasiology, 
but also contributes to the understanding of the charac- 
teristics of human language system, including the evolving 
properties, competition between different terms, the birth 
and fashion of new words, and so on [6ll7]. 

The first step of the study on semiotic dynamics is to 
extract the representative morphemes, such as tags and 
keywords of text, and find out their relations. The main- 
stream methods include the Vector Space Model (VSM) 
[SligifTUlITT] and the Ontology-Based Model (OBM) [H 
[T5] . VSM is an algebraic model, which describes text doc- 
uments as vectors of identifiers. In VSM, a document is 
represented as a vector, and each dimension corresponds 
to a separate term. Several methods have been developed 
to calculate the values, and one of the well-known ways 
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is the TI-IDF weighting [T3]. The weight vector for doc- 
ument d can be defined as: 

Vd^[Wi,d,W2,d,-,WN,df, (1) 

where 

H^..= t/.-logJ|L (2) 

tft is the frequency of term t in document d. \D\ is the 
total number of documents, and \t G d\ is the number of 
documents which contains the term t. An online recom- 
mendation system Fah [T3] is a typical application of VSM. 
However, VSM neglects the semantic content and thus 
its accuracy is sensitive to the word-cutting algorithm. 
Comparatively, OBM uses ontologies to describe the rela- 
tionship between terms. An ontology is a set consisted of 
abstracts, concepts and relations by which we wish to con- 
ceptualisze for the target world. The most typical kind of 
ontology in the web has a taxonomy and an interface rules 
set. The taxonomy defines the classes of terms and rela- 
tions among them, while interface rules make the terms 
more useful and meaningful to users |17| . An ontology- 
based lexical database, namely WordNet [T^, is generic 
ontology and free for research purposes. There are also 
many limitations in the OBM, for the relations between 
morphemes cannot be changed after the definition of a 
domain ontology [T^[T3]. In addition, the keywords in the 
text with special functions are usually confined within a 
previously defined set of words, which update generally 
slower than the frontier of the corresponding subjects. For 
instance, the articles in Physical Reviews (A-E, L) are la- 
beled by PACS Numbers, which can only be selected from 
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a standard set. The analysis on these kinds of semantic 
systems can partly exhibit the correlations and statistical 
evolutionary properties of keywords [inilin], however, the 
establishment of this set of words involves strong external 
disturbances, which hinder the understanding of essential 
evolving properties driven by the semantic system itself. 

There are many ways to identify the semantic charac- 
teristics of acdemic articles which contains plentiful lan- 
guage signs. Thereinto the keywords, being seriously se- 
lected by authors and/or editors, could properly repre- 
sent the main content of the corresponding article. Hence 
the semantic analysis of keywords can not only avoid the 
above limitations in VSM and OBM, but also shed some 
light on the in-depth understanding of the macroscopic 
evolutionary properties of scientific activities. In this pa- 
per, based on the data of a very famous scientific journal, 
we investigated the frequency distribution, the temporal 
scaling behavior, and the decay factor of keywords. 

The rest of this paper is organized as follows. In section 
2, we introduce the empirical data. Section 3 shows the 
empirical results, including the Zipf 's plot of keyword fre- 
quency, the temporal scaling behavior, and the decay fac- 
tor. Finally, we summarize our findings and outline some 
open problems of related topics in section 4. 

2 Data 

In order to ensure the authority and representation of our 
empirical analysis, we choose a journal, namely Proceed- 
ings of the National Academy of Science of the United 
States of America (PNAS), which is a very famous scien- 
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Fig. 1. The frequency of keywords in PNAS follows a Zipf's 
law with exponent 0.86 ± 0.01. 

tific journals among the world. PNAS was found in 1915 
with one volume per year. It publishes the original re- 
search articles and reports the important academic activ- 
ities. We have applied a Java script program to automati- 
cally download the keywords of each article in PNAS from 
the Web of Science. Since the articles published from 1915 
to 1990 do not have keywords, our analysis is limited in the 
collected data from 1991 to 2006 (the documents without 
keywords, such as Correction and Addition, are not consid- 
ered in our analysis), which is consisted of 46149 articles 
and 466470 keywords. Those keywords are the combina- 
tion of two parts: the ones added by authors, and the 
ones proposed by editors (namely Keywords Plus). Note 
that, some keywords are very popular and have been used 
in many articles, thus the number of distinct keywords, 
102992, is much smaller than the number of keyword oc- 
currences (i.e., 466470). Hereinafter, when referring to the 
number of keywords, we mean the total number of key- 
word occurrences. For example, if there are two articles, 
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Fig. 2. The scaling relation between r and N{t). The 16 
points, from left to right, represent the cumulative data. That is 
to say, the leftmost point corresponds to the cumulative value 
up to the year 1991, the second point from left denotes the 
cumulative value up to the year 1992, etc. 

one has keywords A, B and C, while the other has key- 
words C and D. Then, we say there are 5 keywords, and 
4 distinct keywords. Data of some other journals are also 
analyzed for comparison (see below). To be comparable, 
we also extract the data from 1991 to 2006. 

3 Statistical Analysis 

3.1 Zipf's law of keywords' occurrences 

In 1930s, Zipf put forth a famous law of frequency distri- 
bution of phrases, namely Zipf's Law [21j . which has been 
widely used to characterize the distributions of firm size 
p2l[23]. city scale wealth [251126] . earthquake strength 
P7] , and so on. Rank the phrases in a descending order ac- 
cording to their occurring frequency in a text, Zipf found 
a power-law relation between the rank, n, and its corre- 
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sponding frequency, P„, as: 

P„="-". (3) 

As shown in Fig. 1, the frequency distribution of keywords 
in PNAS approximately follows a Zipf's law with expo- 
nent 0.86 crossing 4 magnitudes. Most keywords are of 
low frequencies, while a few popular keywords appear very 
frequently. Up to 2006, the most popular keyword. Ex- 
pression, has been used for 6927 times. Meanwhile, there 
are 66782 (64.84%) distinct keywords used only once. As 
shown in Table 1, this Zipf's law is universally existed for 
various scientific journals in different subjects. 

3.2 Scaling between the number of distinct keywords 
and the total number of keywords 

A keyword in a new publication is either a new one or 
has appeared in a prior article. Denote r the cumulative 
number of keywords, and N(t) the corresponding cumu- 
lative number of distinct keywords. Figure 2 presents a 
power-law relation between t and N{t) during the evolv- 
ing process from the year 1991 to 2006. The dash line, 
with slope 0.750 ± 0.007, is the linear fitting of the data, 
that is to say, 

N{t)^ct^, (4) 

where A « 0.75, and c is a constant. From Eq. ([4]), one 
can derive that the growing rate of distinct keywords is 
cAt^~^, where r is the number of keywords. When A = 1, 
there exists a linear relation between the number of newly 
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Fig. 3. (Color online) The decay factor rt as a function of 
time (year resolution) for different journals. The inset com- 
pares PNAS and several local journals with much lower impact 
factors. The full titles of the journals can be found in Table 1. 

added distinct keywords and that of the newly added key- 
words, and the growing rate is a constant c. When A < 1, 
the growing rate of distinct keywords will decrease with 
the increasing of the total number of keywords. Actually, 
if the number of distinct keywords is N ^ the probability 
that the next keyword has not been used before (i.e., dis- 
tinct) is equal to c^^^^XN^^^ , which will decreases with 
the increase of N when A < 1. The data from some other 
scientific journals indicate the universality of this scaling 
law (see Table 1). 



Table 1. Statistics of several journals from different subjects, 
including Appl. Phys. Lett. (APL), British J. Pharmacology 
(BJP), EMBO J. (EMBOJ), Annals of Neurology (AN), SIAM 
J. Appl. Math. (SIAM), Chin. Sci. BuU (CSB), CZECH. J. 
PHYS. (CJP), J. CHEM. SOC. PAKISTAN (JCSP). a is the 
exponent in the Zipf 's plot and A is the scaling exponent de- 
fined in Eq. (4). IF stands for the impact factor of the journal 
in 2007. 



Journal Title 


IF 


a 


A 


APL 


3.596 


l.OliO.Ol 


0.683±0.008 


BJP 


3.767 


0.92±0.01 


0.753±0.006 


PNAS 


9.598 


0.86±0.01 


0.750±0.007 


EMBOJ 


8.662 


0.86±0.01 


0.753±0.003 


AN 


8.813 


0.83±0.02 


0.716±0.005 


SIAM 


1.026 


0.58±0.02 


0.825±0.005 


CSB 


0.77 


0.51±0.01 


0.857±0.013 


CJP 


0.423 


0.48±0.01 


0.912±0.002 


JCSP 


0.095 


0.39±0.01 


0.916±0.004 



in articles are considered seriously by the authors and ed- 
itors, so the keyword-based systems are more canonical 
and serious. However, both tags and keywords follow the 
same scaling law. This result indicates a possibly universal 
law for the generic semantic systems. 



3.3 Decaying behavior of the most popular keywords 



Surprisingly, some recently empirical studies demon- 
strate the extensive existence of this kind of scaling law, 

with the same form as Eq. (4), in the web tag systems [28l j r , r i j j -i n x- 

^ ' ° ' — ihe decay factor rt of a keyword describes the collective 



[^51l30| . Note that, the collaborative tagging system is an 
open and optional system where each user can optionally 
modify the tags in the system. In contrast, the keywords 



decay of attention, which can be defined as [5T] : 
logiVi - log7Vt_i 



n 



logiVi - logTVo 



(5) 
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Fig. 4. The decaying trend of top- 10 keywords in the year 
1991 for PNAS. The circles represent empirical result obtained 

by Eq. (6), while the solid curve corresponds to the fitting 
function shown as Eq. (7). 

where Nt denotes the cumulative occurring frequency of 
the monitored keyword at time t with year resolution. Nq 
is the occurring frequency in the first year (i.e., the year 
1991). In order to reduce the fluctuation, when analyzing 
the decay factor, we use the aggregated data of several 
keywords, thus the Eq. (5) should be rewritten as: 



n = 



E{logNt) - E{logNt-i) 



(6) 



^(logiVi) - £;(iogAro) ' 

where £'(•) denotes the average over the monitored set of 
keywords. We analyze the decay factor of ten most popular 
keywords (top-10 keywords for short) in the year 1991. 
As shown in Fig. 3, rj of PNAS (red circles) decays very 
fast in the first three years, and then slows down. The 
decay factor almost decreases to a half in 1993. Actually, 
its decaying trend can be well fitted by an exponential 
function as 

y = Ai*e-^/*^+2/o, (7) 



where yo= 0.10 ± 0.01, Ai= 1.47 ± 0.09, ti= 1.88 ± 0.13, 
and the time x varies from 1 (the year 1992) to 15 (the year 
2006). The fitting curve versus empirical result is shown 
in Fig. 4. This decaying trend can be used to quantify 
the broadness of interests of a journal. For a journal with 
high impact factor, it is possible and reasonable that 
decays very fast in the early stage since it mainly pub- 
lishes the newest progress in natural science with some 
new concepts. 

We also empirically study the decaying behavior of 
top-10 keywords for several top journals in different sub- 
jects, from biology to mathematics. As shown in Fig. 3, 
all those decaying curves display similar tendency. In con- 
trast, as shown in the inset of Fig. 3, rj of three local 
journals with relatively lower scientific impacts have far 
different shapes compared with those top journals. Actu- 
ally, the decay factor rt exhibits large fluctuation and no 
obvious decaying tendency can be observed even in a long 
period of time (15 years). A possible reason is that those 
journals with low impacts do not publish as many newest 
progresses as top journals. 

4 Conclusion and Discussion 

In this paper, we empirically investigated the statistical 
characteristics and the evolutionary properties of keywords 
in a very famous journal, namely PNAS, including the fre- 
quency distribution, the temporal scaling behavior, and 
the decay factor. Firstly, the empirical results indicate 
that the frequency distribution of keywords in PNAS ap- 
proximately follows a Zipf 's law with exponent 0.86, which 
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means only a few keywords are used frequently in PNAS, 
whereas most of keywords are used unusually. Secondly, 
there is a power-low correlation between the number of 
distinct keywords and the total number of keyword oc- 
currences. We have also investigated the data of some 
other journals in different subjects, which strongly indi- 
cate the universality of those two statistical properties. In 
addition, we studied the decay factor of the most popu- 
lar keywords. Interestingly, the top journals, though from 
far different subjects, exhibit very similar decaying behav- 
ior that can be approximately fitted by an exponentional 
function. While the journals with lower impact factors ex- 
hibit very different behaviors, actually, no obvious decay- 
ing tendency is observed. 

The studies of systems with collaborative keywords are 
also relevant to the recent progress on the design of rec- 
ommender systems. Actually, with the advance of Web2.0 
technique, a great number of recommendation algorithms 
were applied to some on-line resource-sharing systems |33| , 
which can recommend music, films, books and news to 
users according to their historical activities. Up to now, 
the most accurate algorithm is content-based [31]. How- 
ever, those content-based methods are practical only if 
the items have well-defined attributes, which can be ex- 
tracted automatically. The traditional content analyzing 
approach, based on cutting the content word by word, is 
often impractical since its computational complexity is too 
high for the huge-size database. In contrast, the structure- 
based algorithm has less complexity but also lower accu- 
racy [551I5M57] . Because the keywords of an article can 
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express, to some extent, the main content of this arti- 
cle, an algorithm with low complexity and high accuracy 
is expected by properly integrating the recommendations 
drawn from the keyword-article bipartite graph and the 
author-article bipartite graph (see Ref. [35 how to get 
recommendations from a bipartite graph). 

In addition, a Keyword-Based Collaboration Network 
(KBCN) can be constructed based on the definition that 
two keywords are connected if they appeared together in 
at least one article. More characters about the structural 
organization of a keyword-based semantic system can be 
analyzed with the help of KBCN (see Refs. pilMl HO] how 
to construct and analyze collaboration networks). Espe- 
cially, the in-depth understanding of the hierarchical orga- 
nization |41j . the community structure [42j and the motif 
density [13] are crucial for the classification of research ar- 
eas and the evaluation on the strength of interdisciplinary 
studies. 
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